Parallelization framework has become a necessity to speed up the training ofdeep neural networks (DNN) recently. Such framework typically employs the ModelAverage approach, denoted as MA-DNN, in which parallel workers conductrespective training based on their own local data while the parameters of localmodels are periodically communicated and averaged to obtain a global modelwhich serves as the new start of local models. However, since DNN is a highlynon-convex model, averaging parameters cannot ensure that such global model canperform better than those local models. To tackle this problem, we introduce anew parallel training framework called Ensemble-Compression, denoted as EC-DNN.In this framework, we propose to aggregate the local models by ensemble, i.e.,averaging the outputs of local models instead of the parameters. As most ofprevalent loss functions are convex to the output of DNN, the performance ofensemble-based global model is guaranteed to be at least as good as the averageperformance of local models. However, a big challenge lies in the explosion ofmodel size since each round of ensemble can give rise to multiple times sizeincrement. Thus, we carry out model compression after each ensemble,specialized by a distillation based method in this paper, to reduce the size ofthe global model to be the same as the local ones. Our experimental resultsdemonstrate the prominent advantage of EC-DNN over MA-DNN in terms of bothaccuracy and speedup.
展开▼